In [ ]:
from IPython.display import Image

Introduction to NumPy and Pandas

Introduction to NumPy

  • most fundamental third-party package for scientific computing in Python
  • multidimensional array data structures
  • associated functions and methods to manipulate them.
  • Other third-party packages, including pandas, use NumPy arrays as backends for more specialized data structures

Comparison to Python

  • While Python comes with several container types (list,tuple,dict),
  • NumPy's arrays are implemented closer to the hardware, and are therefore more efficient than the built-in types.
  • This is particularly true for large data, for which NumPy scales much better than Python's built-in data structures.
  • NumPy arrays also retain a suite of associated functions and methods that allow for efficient array-oriented computing.

Import Convention

  • By convention numpy is imported

In [ ]:
import numpy as np

NumPy Arrays and Indexing

  • You can index an array in the same way you can index Python lists using slice notation

In [ ]:
lst = list(range(1000))
arr = np.arange(1000)

Here's what the array looks like


In [ ]:
arr[:10]

In [ ]:
arr[10:20]

In [ ]:
arr[10:20:2]

In [ ]:
type(arr)

In [ ]:
%timeit [i ** 2 for i in lst]

In [ ]:
%timeit arr ** 2

We can index arrays in the same ways as lists


In [ ]:
arr[5:10]

In [ ]:
arr[-1]

Arrays vs Lists

  • arrays are homogeneously typed
    • all elements of an array must be of the same type.
    • we see why when we think about the memory layout
  • lists can contain elements of arbitrary type

In [ ]:
['a', 2, (1, 3)]

In [ ]:
lst[0] = 'some other type'

In [ ]:
lst[:3]
  • We can't do this with an array

In [ ]:
arr[0] = 'some other type'
  • The data type is contained in the dtype attribute

In [ ]:
arr.dtype
  • The dtype is fixed
  • Other types will be cast to this type

In [ ]:
arr[0] = 1.234

In [ ]:
arr[:10]

What is an Array

  • Sometimes it's useful to peak under the hood to fix ideas
  • A block of memory with some extra information on how to intepret its contents

In [ ]:
Image("https://docs.scipy.org/doc/numpy/_images/threefundamental.png")

Array Creation


In [ ]:
np.zeros(5, dtype=float)

In [ ]:
np.zeros(5, dtype=int)

In [ ]:
np.zeros(5, dtype=complex)

In [ ]:
np.ones(5, dtype=float)
  • We have seen how the arange function generates an array for a range of integers.
  • linspace and logspace functions to create linearly and logarithmically-spaced grids respectively, with a fixed number of points and including both ends of the specified interval:

In [ ]:
np.linspace(0, 1, num=5)

In [ ]:
np.logspace(1, 4, num=4)

Random Number Generation

Finally, it is often useful to create arrays with random numbers that follow a specific distribution. The np.random module contains a number of functions that can be used to this effect, for example this will produce an array of 5 random samples taken from a standard normal distribution (0 mean and variance 1) $ X \sim N(0, 1) $:

$$f(x \mid \mu=0, \sigma=1) = \sqrt{\frac{1}{2\pi \sigma^2}} \exp\left\{ -\frac{x^2}{2\sigma^2} \right\}$$


In [ ]:
np.random.randn(5)

$X \sim N(9, 3)$


In [ ]:
norm10 = np.random.normal(loc=9, scale=3, size=10)

Exercise: Random numbers

Generate a NumPy array of 1000 random numbers sampled from a Poisson distribution, with parameter lam=5. What is the modal value in the sample?


In [ ]:
%load solutions/random_number.py

Index Arrays

  • Above we showed how to index with numbers and slices
  • NumPy indexing is much more powerful than Python indexing
  • You can index with other arrays
    • Boolean arrays
    • Integer arrays

Consider for example that in the array norm10 we want to replace all values above 9 with the value 0. We can do so by first finding the mask that indicates where this condition is True or False:

Boolean Indexing


In [ ]:
mask = norm10 > 9
mask

In [ ]:
norm10[mask]

Integer Indexing

  • Likewise you can index with integer arrays

In [ ]:
norm10[[1, 4, 6]]

Asssignment

  • This form of indexing is known as fancy-indexing
  • You can use fancy-indexing for assignment
    • This is particularly useful for assignment given some condition

In [ ]:
norm10[norm10 > 9] = 0

In [ ]:
norm10

In [ ]:
norm10[[1, 4, 7]] = 10

In [ ]:
norm10

Copies vs Views

  • This is a common gotcha for people new to NumPy
  • While lvalue fancy-indexing in the case of assignment does not copy
    • Just __setitem__
  • rvalue fancy-indexing produces a copy not a view
    • __getitem__ followed by __setitem__
  • When we use slice notation to look at part of an array, it produces a view
  • That is, it points to the same memory of the original array

In [ ]:
x = np.arange(10)

In [ ]:
x

In [ ]:
y = x[::2]
y

In [ ]:
y[3] = 100
y

In [ ]:
x
  • This, however, produces a copy
  • Operating on the copy will not affect the original array

In [ ]:
a = norm10[[0, 1, 5]]

In [ ]:
a

In [ ]:
a[:] = -10

In [ ]:
a

In [ ]:
norm10

Exercise

Create an array [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] without typing the values by hand. Assign 100 to elements 2 to 5 (zero-index). Print the array.

Create the same array as in step one above. Create an array from a slice of elements 2 to 5. Assign 100 to the slice. Hint try [:] to address all of the elements of an array. Print the original array and the slice.


In [ ]:
# [Solution here]

In [ ]:
%load solutions/copies_vs_views.py

Multidimensional Arrays

  • NumPy can create arrays of aribtrary dimensions, and all the methods illustrated in the previous section work with more than one dimension.
  • For example, a list of lists can be used to initialize a two dimensional array:

In [ ]:
samples_list = [[632, 1638, 569, 115], [433,1130,754,555]]
samples_array = np.array(samples_list)
samples_array.shape

In [ ]:
print(samples_array)

With two-dimensional arrays we start seeing the convenience of NumPy data structures: while a nested list can be indexed across dimensions using consecutive [ ] operators, multidimensional arrays support a more natural indexing syntax with a single set of brackets and a set of comma-separated indices:


In [ ]:
samples_list[0][1]

In [ ]:
samples_array[0,1]

Most of the array creation functions listed above can be passed multidimensional shapes. For example:


In [ ]:
np.zeros((2,3))

In [ ]:
np.random.normal(10, 3, size=(2, 4))

In fact, an array can be reshaped at any time, as long as the total number of elements is unchanged. For example, if we want a 2x4 array with numbers increasing from 0, the easiest way to create it is via the array's reshape method.


In [ ]:
arr = np.arange(8).reshape(2,4)
arr

With multidimensional arrays, you can also use slices, and you can mix and match slices and single indices in the different dimensions (using the same array as above):


In [ ]:
arr[1, 2:4]

In [ ]:
arr[:, 2]

If you only provide one index, then you will get the corresponding row.


In [ ]:
arr[1]

Now that we have seen how to create arrays with more than one dimension, it's a good idea to look at some of the most useful properties and methods that arrays have. The following provide basic information about the size, shape and data in the array:


In [ ]:
print('Data type                :', samples_array.dtype)
print('Total number of elements :', samples_array.size)
print('Number of dimensions     :', samples_array.ndim)
print('Shape (dimensionality)   :', samples_array.shape)
print('Memory used (in bytes)   :', samples_array.nbytes)

Arrays also have many useful methods, some especially useful ones are:


In [ ]:
print('Minimum and maximum             :', samples_array.min(), samples_array.max())
print('Sum, mean and standard deviation:', samples_array.sum(), samples_array.mean(), samples_array.std())

For these methods, the above operations area all computed on all the elements of the array. But for a multidimensional array, it's possible to do the computation along a single dimension, by passing the axis parameter; for example:


In [ ]:
samples_array.sum(axis=0)

In [ ]:
samples_array.sum(axis=1)
  • Notice that summing over the rows returned a 1d array above.
  • If you want to preserve the dimensions use the keepdims keyword

In [ ]:
samples_array.sum(axis=1, keepdims=True)

Another widely used property of arrays is the .T attribute, which allows you to access the transpose of the array:


In [ ]:
samples_array.T

There is a wide variety of methods and properties of arrays.


In [ ]:
[attr for attr in dir(samples_array) if not attr.startswith('__')]

What is a Multi-Dimensional Array

  • memory is a linear address space
  • by adding information on shape and strides we can interpet bytes laid out linearly in memory as a multidimensional object

In [ ]:
Image('https://ipython-books.github.io/images/layout.png')

Exercises: Matrix Creation

Generate the following structure as a numpy array, without typing the values by hand. Then, create another array containing just the 2nd and 4th rows.

    [[1,  6, 11],
     [2,  7, 12],
     [3,  8, 13],
     [4,  9, 14],
     [5, 10, 15]]

In [ ]:
%load solutions/matrix_creation.py

Array Operations, Methods, and Functions


In [ ]:
sample1 = np.array([632, 1638, 569, 115])
sample2 = np.array([433,1130,754,555])

sample_sum = sample1 + sample2

In [ ]:
np.array([632, 1638, 569, 115])

This includes the multiplication operator -- it does not perform matrix multiplication, as is the case in Matlab, for example:


In [ ]:
print('{0} X {1} = {2}'.format(sample1, sample2, sample1 * sample2))

In Python 3.5, you can use the @ operator to get the inner product (or matrix multiplication) (!)


In [ ]:
print('{0} . {1} = {2}'.format(sample1, sample2, sample1 @ sample2))
  • this implies that the dimension of the arrays for each operation must match in size,
  • numpy will broadcast dimensions when possible
  • For example, suppose that you want to add the number 1.5 to each element arr1
  • We achieve this by broadcasting

In [ ]:
sample1 + 1.5

In this case, numpy looked at both operands and saw that the first was a one-dimensional array of length 4 and the second was a scalar, considered a zero-dimensional object. The broadcasting rules allow numpy to:

  • create new array of length 1
  • extend the array to match the size of the corresponding array

So in the above example, the scalar 1.5 is effectively cast to a 1-dimensional array of length 1, then stretched to length 4 to match the dimension of arr1. After this, element-wise addition can proceed as now both operands are one-dimensional arrays of length 4.

This broadcasting behavior is powerful, especially because when NumPy broadcasts to create new dimensions or to stretch existing ones, it doesn't actually replicate the data. In the example above the operation is carried as if the 1.5 was a 1-d array with 1.5 in all of its entries, but no actual array was ever created. This saves memory and improves the performance of operations.

When broadcasting, NumPy compares the sizes of each dimension in each operand. It starts with the trailing dimensions, working forward and creating dimensions as needed to accomodate the operation. Two dimensions are considered compatible for operation when:

  • they are equal in size
  • one is scalar (or size 1)

If these conditions are not met, an exception is thrown, indicating that the arrays have incompatible shapes.


In [ ]:
sample1 + np.array([7,8])

In [ ]:
b = np.array([10, 20, 30, 40])

bcast_sum = sample1 + b

In [ ]:
print('{0}\n\n+ {1}\n{2}\n{3}'.format(sample1, b, '-'*21, bcast_sum))

In [ ]:
c = np.array([-100, 100])
sample1 + c

Remember that matching begins at the trailing dimensions. Here, c would need to have a trailing dimension of 1 for the broadcasting to work. We can augment arrays with dimensions on the fly, by indexing it with a np.newaxis object, which adds an "empty" dimension:


In [ ]:
cplus = c[:, np.newaxis]
cplus

In [ ]:
cplus.shape

In [ ]:
sample1 + cplus

In [ ]:
sample1[:, np.newaxis] + c

Exercises: Array Manipulation

Divide each column of the array:

a = np.arange(25).reshape(5, 5)

elementwise with the array

b = np.array([1., 5, 10, 15, 20])

In [ ]:
# [Solution here]

In [ ]:
%load solutions/broadcasting.py

What Else

  • NumPy provides much more functionality than what we covered here
  • For example, facilities for linear algebra, FFTs, polynomials, and unit testing for floating point

Introduction to Pandas

pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python.

pandas is well suited for:

  • Tabular data with heterogeneously-typed columns, as you might find in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data with row and column labels

Virtually any statistical dataset, labeled or unlabeled, can be converted to a pandas data structure for cleaning, transformation, and analysis.

Key features

  • Easy handling of missing data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes
  • Robust IO tools for loading data from flat files, Excel files, databases, and HDF5
  • Time series functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Import Convention


In [ ]:
import pandas as pd

Pandas Series

  • A pandas Series is a generationalization of 1d numpy array
  • A series has an index that labels each element in the vector.
  • A Series can be thought of as an ordered key-value store.

In [ ]:
counts = pd.Series([632, 1638, 569, 115])
counts

If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the Series, while the index is a pandas Index object.


In [ ]:
counts.values

Index Object

Pandas provides a labeled index to access the rows


In [ ]:
counts.index

We can assign meaningful labels to the index, if they are available:


In [ ]:
bacteria = pd.Series([632, 1638, 569, 115], 
                     index=['Firmicutes', 'Proteobacteria', 
                            'Actinobacteria', 'Bacteroidetes'])

bacteria

NumPy's math functions and other operations can be applied to Series without losing the data structure.


In [ ]:
np.log(bacteria)
  • Creation from a dict
  • Returned in key-sorted order

In [ ]:
bacteria_dict = {
    'Firmicutes': 632, 
    'Proteobacteria': 1638,
    'Actinobacteria': 569, 
    'Bacteroidetes': 115
}

pd.Series(bacteria_dict)

Pandas DataFrames

Inevitably, we want to be able to store, view and manipulate data that is multivariate, where for every index there are multiple fields or columns of data, often of varying data type.

A DataFrame is a tabular data structure, encapsulating multiple series like columns in a spreadsheet.


In [ ]:
data = pd.DataFrame({'value': [632, 1638, 569, 115, 433, 1130, 754, 555],
                     'patient': [1, 1, 1, 1, 2, 2, 2, 2],
                     'phylum': ['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
                                'Bacteroidetes', 'Firmicutes', 'Proteobacteria',
                                'Actinobacteria', 'Bacteroidetes']})
data
  • We often will want to peak at the first few rows of a DataFrame
  • You can use head to do this

In [ ]:
data.head()

Columns as an Index

The first axis of a DataFrame also has an index that represent the labeled columns


In [ ]:
data.columns

Reading and Writing Files

  • Pandas provides sophisticated I/O functionality
  • read_csv is a highly optimized csv reader

In [ ]:
vessels = pd.read_csv("../data/AIS/vessel_information.csv")
vessels.head()

Exercises

  • Read a single file ../data/NationalFoodSurvey/NFS_1974.csv

In [ ]:
%load solutions/read_nfs_1974.py